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Abstract 

Sparsity-inducing penalties are useful tools to design multiclass support vector machines 
(SVMs). In this paper, we propose a convex optimization approach for efficiently and exactly 
solving the multiclass SVM learning problem involving a sparse regularization and the multiclass 
hinge loss formulated by [T]. We provide two algorithms: the first one dealing with the hinge 
loss as a penalty term, and the other one addressing the case when the hinge loss is enforced 
through a constraint. The related convex optimization problems can be efficiently solved thanks 
to the flexibility offered by recent primal-dual proximal algorithms and epigraphical splitting 
techniques. Experiments carried out on several datasets demonstrate the interest of considering 
the exact expression of the hinge loss rather than a smooth approximation. The efficiency of the 
proposed algorithms w.r.t. several state-of-the-art methods is also assessed through comparisons 
of execution times. 


1 Introduction 


Support vector machines (SVMs) have gained much popularity in solving large-scale classihcation 
problems. As a matter of fact, many applications considered in the literature deal with a large 
amount of training data or a huge (even infinite) number of classes [21 |3l lU [5l E] . Consequently, 
the major difficulty encountered in this kind of applications stems from the computational cost. 
The SVM learning problem is classically solved by using standard Lagrangian duality techniques 
mm- This approach brings in several advantages, such as the kernel trick [8], or the possibility 
to break the problem down into a sequence of smaller ones mm- Some works also proposed to 
approximate the dual problem using cutting plane approaches, in order to address scenarios with 
thousands or even an infinite number of classes mm- 

*This work was supported by the CNRS IMAG’in OPTIMISME project 

^G. Chierchia (Corresponding author) and B. Pesquet-Popescu are with Telecom ParisTech/Institut Tflecom, 
LTCI, UMR CNRS 5141, 75014 Paris, France (e-mail: first.last@telecom-paristech.fr). 

*N. Pustelnik is with the Laboratoire de Physique de TENS Lyon, CNRS UMR 5672, F69007 Lyon, France. Phone: 
-|-33 4 72 72 86 49, E-mail: nelly.pustelnik@ens-lyon.fr. 

D.-C. Pesquet is with the Universite Paris-Est, LIGM, CNRS-UMR 8049, 77454 Marne-la-Vallee Cedex 2, France. 
Phone: -1-33 1 60 95 77 39, E-mail: jean-christophe.pesquet@univ-paris-est.fr. 


1 



In some applications, however, only a small number of training data is available. This is 
undoubtedly true in medical contexts, where the goal is to classify a patient as being “healthy”, 
“contaminated”, or “infected”, but the verified cases of infected patients might be just a few. In 
such applications, the lack of training data may lead to the so-called overfitting problem, eventually 
leading to a prediction which is too strongly tailored to the particularities of the training set and 
poorly generalizes to new data. 

A common solution to prevent overfitting consists of introducing a sparsity-inducing regulariza¬ 
tion in order to perform an implicit feature selection that gets rid of irrelevant or noisy features. In 
this respect, the £i-norm and, more generally, the £i^p-norm regularization have attracted much 
attention over the past decade [m na [la 0 da [iHi Ea US] . However, when a sparse regularization 
is introduced, the dual approach does no longer yield a simple formulation. Therefore, SVMs 
with sparse regularization lead to a nonsmooth convex optimization problem which is challenging. 
The main objective of this paper is to exactly and efficiently solve the multiclass SVM learning 
problem for convex regularizations. To this end, we propose two algorithms based on a primal-dual 
proximal method |19l [20] and a novel epigraphical splitting technique m- In addition to more 
detailed theoretical developments, this paper extends our preliminary work [22] by providing a 
new algorithm, and a larger number of experiments including comparisons with state-of-the-art 
methods for different types of database. 


1.1 Related work 

The use of sparse regularization in SVMs was firstly proposed in the context of binary classification. 
The idea traces back to the work by |23j, who demonstrated that the £i-norm regularization can 
effectively perform “feature selection” by shrinking small coefficients to zero. Other forms of 
regularization have also been studied, such as the ^g-norm |2d| , the £p-norm with p > 0 |25j , the 
.^oo-iiorm [26], and the combination of io-ii norms m or ii-i 2 norms [28]. A different solution 
was proposed by |29j . who reformulated the SVM learning problem by using an indicator vector 
(its components being either equal to 0 or 1) to model the active features, and solved the resulting 
combinatorial problem by convex relaxation using a cutting-plane algorithm. More recently, m 
proposed an accelerated algorithm for £ 1 -regularized SVMs involving the square hinge loss. They also 
proposed a procedure for handling nonconvex regularization (using the reweighted £i-minimization 
scheme by m), showing that nonconvex penalties lead to similar prediction quality while using 
less features than convex ones. 

Binary SVMs can be turned into multiclass classihers by a variety of strategies, such as the 
one-vs-all approach [TIES]. While these techniques provide a simple and powerful framework, they 
cannot capture the correlations between different classes, since they break a multiclass problem into 
multiple independent binary problems. [1] therefore proposed a direct formulation of multiclass 
SVMs by generalizing the notion of margins used in the binary case. A natural idea thus consists 
of equipping muticlass SVMs with sparse regularization. A simple example is the £ 1 -regularized 
multiclass SVM, which can be addressed by linear programming techniques |33j . In multiclass 
problems however, feature selection becomes more complex than in the binary case, since multiple 
discriminating functions need to be estimated, each one with its own set of important features. 
For this reason, mixed-norm regularization has recently attracted much interest due to its ability 
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to impose group sparsity [MlESlIIll I36j . In the context of multiclass SVMs, 137! proposed to 
deal with the £ 1 , 00 -norm regularization by reformulating the SVM learning problem in terms of 
linear programming. However, they validated their method on small-size problems, indicating that 
the linear reformulation may be inefficient for larger-size ones. More recently, |38j proposed an 
algorithm to handle €i^ 2 -i'egularized SVMs involving a smooth loss function. While their method is 
efficient and can handle other convex regularizations, it does not solve rigorously the multiclass 
SVM learning problem, possibly leading to performance limitations. 


1.2 Contributions 

The algorithmic solutions proposed in the literature to deal with sparse multiclass SVMs are either 
cutting-plane methods |29] , proximal algorithms [301138] , or linear programming techniques |331[37|. 
However, both cutting-plane methods and proximal algorithms have been employed to find an 
approximate solution, while linear programming techniques may not scale well to large datasets. In 
this paper, we propose a novel approach based on proximal tools and recent epigraphical splitting 
techniques |2T] , which allow us to exactly solve the sparse multiclass SVM learning problem through 
an efficient primal-dual proximal method [I9l|20]. 


1.3 Outline 

The paper is organized as follows. In Section we formulate the multiclass SVM problem with 
sparse regularization, in Section we provide the proximal tools needed to solve the proposed 
problem, and in Section we evaluate our approach on three standard datasets and compare it to 
the methods proposed by [38] . m, m, and HU. 


1.4 Notation 

ro(M^) denotes the set of proper, lower semicontinuous, convex functions from the Euclidean space 
to ]— 00 , -foo]. The epigraph of ip £ ro(M^) is the nonempty closed convex subset of x M 
defined as epiip = {{y,C) £ 1^ | '4’iy) ^ C}- For every x £ M^, the subdifferential oi ip at x 

is d'il){x) = {u £ I (Vy G M^) {y — x \ u) + 'ip{x) < V’(y)}- Let C be a nonempty closed convex 
subset of M^, then lc is the indicator function of C, equal to 0 on C and -foo otherwise. 


2 Sparse Multiclass SVM 

A multiclass classifier can be modeled as a function d: —)• {1,..., AT} that predicts the class 

k £ {1,... ,K} associated to a given observation u £ (e.g. a signal, an image or a graph). 

This predictor relies on K different discriminating functions 1 —>■ M which, for every 

k £ {1,...,A'}, measure the likelihood that an observation belongs to the class k. Consequently, 
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the predictor selects the class that best matches an observation, i.e. 


d{u) £ argmax Dk{u). 

k£{l,...,K} 


In supervised learning, the discriminating functions are built from a set of L input-output pairs 

5= X I £ = L}}, 

and they are assumed to be linear in some feature representation of inputs m- The latter 
assumption leads to the following form of the discriminating functions: 

Dk{u) = cj){uy (1) 


where (j ): i—)■ denotes a mapping from the input space onto an arbitrary feature space, and 

b^^^)i<k<K denote the parameters to be estimated. For convenience, we concatenate the latter 
ones into a single vector x G 


5 ( 1 ) 


X 


( 1 ) 


bW 


x(^) 


and we define the function ip: i—)> as 


ip{u) 


1 


so that Q can be shortened to Dk{u) = 


2.1 Background 

The objective of learning consists of finding the vector x such that, for every i G {1,..., L}, the 
input-output pair {u£,Zi) G 5 is correctly predicted by the classifier, i.e., 

Zi = argmax 

k£{l,...,K} 

By the definition of argmax, the above equality holds i0 

(V£ G {1,..., L}) max ip{ue)~^< 0, 

k^zi 

or, equivalently, 

(V.^ G {1,...,L}) max ip{uey {x^'''> - < -pi, (2) 

k^ze 

^To simplify the notation, we shorten k G {1,..., K} \ {zt} to k ^ zt. 
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where, for every i G L}, [ii is a positive scalar. Unfortunately, this constraint has no 

practical interest for learning purposes, as it becomes infeasible when the training set is not fully 
separable. Multiclass SVMs overcome this issue by introducing the notion of soft margins, which 
consists of adding a vector of slack variables ^ = {C^^'^)i<e<L into (§: 

G {1, ...,U}) max Lp{u(,)^ 

{ (3) 

l(V^G{l,...,L})eW >0, 


The multiclass SVM learning problem is thus obtained by adding a quadratic regularization [T], 
yielding 


minimize 


K 

El 

|x(^)||i +s.t 



k=l 

e=i 




Uwg{i,... 

■,L}) 

max ip{ue)'^ — fie 

k^zt 


\(V£g 

■,L}) 

IV 


where A G ]0, +oo[. Note that the linear penalty on the slack variables allows us to minimize the 
violation of constraint (§. By using standard convex analysis [lO], the above problem can be 
equivalently rewritten without slack variables as 


K L 

minimize lln + A max |0, + max 09(u£)'''(x’^*^^ — (5) 

xeR{"+i)^ ^ ^ ^ 


Hereabove, the second term is called hinge loss when ^£ = 1. 


2.2 Proposed approach 


We extend Problem ([^ by replacing the squared £ 2 -norm regularization with a generic function 
g G Moreover, we rewrite the hinge loss in an equivalent form by introducing, for 

every G {1,..., L}, the linear operator i—). defined as 


(Vx G r,x= 


the vector re 


{r^e^^)i<k<K £ defined as 


(VfeG {!,..., K}) rf 



if = Zi, 
otherwise. 


and the function he: !->■ M defined, for every , as 

he{y^^^) = max 

^Note that the regularization does not involve the offsets {b^'’^)\<k<K- 


( 6 ) 
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so that the following holds 


h£{T(;x) = max {O,+ max (^(u£)''~(x^^^ — 

k^zi 

We aim at solving the following convex optimization problems: 


• regularized formulation 


L 

minimize ofx) + A > hfiTpyi), 

xeR("+i)^ ^ 


( 7 ) 


• constrained formulation 


minimize o(x) s. t. 

^g]R(M+l)K 


L 

J2hi{Te^) < rj, 

e=i 


( 8 ) 


where A and rj are positive constants. Note that, by Lagrangian duality, the above formulations 
are equivalent for some specific values of r] and A. The interest of considering the constrained 
formulation lies in the fact that r/ may be easier to set, since it is directly related to the properties 
of the training data. 


As mentioned in the introduction, the regularization term g is chosen so as to promote some 
form of sparsity. A popular example is the £i-norm, as it ensures that the solution will have a 
number of coefficients exactly equal to zero, depending on the strength of the regularization m- 
Another example is given by the mixed £i^p-norm. For every x G let us assume that, for 

each k G {1,...,A'}, the vector x^*^^ G is block-decomposed as follows: 


xW = 




size Ml size Mb 

with Ml + ■ ■ ■ + Mb = M. We define the £i^p-norm as 

K B 




\x 


(fc,6)| 


k=l b=l 


The mixed-norm regularization is known to induce block-sparsity: the solution is partitioned into 
groups and the components of each group are ideally either all zeros or all non-zeros. In this context, 
the exponent values p = 2 or p = -|-oo are the most popular choices. In particular, the fi^oo-norm 
tends to favor solutions with few nonzero groups having components of similar magnitude. 


3 Optimization method 


The resolution of Problems Q and Q requires an efficient algorithm for dealing with nonsmooth 
functions and hard constraints. In the convex optimization literature, proximal algorithms constitute 
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one of the most efficient approaches to deal with nonsmooth problems HU HU HSl 0311111. The key 
tool in these methods is the proximity operator |l5], defined for a function if} ^Tq{ 1-L) as 

(VuGH) prox^(ri) = argmin -||r; — u||^ + 

v&H 2 

The proximity operator can be interpreted as a sort of subgradient step for the function as 

p = prox^(y) is uniquely defined through the inclusion y — p £ dip{p). In addition, it reverts to the 

projection onto a closed convex set C C Ti in the case when xf; = lc, in the sense that 

(Vtt G Ti) prox (?x) = Pc{u) = argmin -||u — tt|p. (9) 

vec 2 


Proximal algorithms work by iterating a sequence of steps in which the proximity operators of 
the functions involved in the minimization are evaluated at each iteration. An efficient computation 
of these operators is thus essential to design fast algorithms for solving Problems Q-Q. In the next 
sections, we will present two different approaches based on a Forward-Backward based Primal-Dual 
method (FBPD) |19l [20l 0^ 071 148j . which we have selected among the large panel of proximal 
algorithms for its simplicity to deal with large-size linear operators. 


3.1 Regularized formulation 


Problem Q fits nicely into the framework provided by FBPD algorithm, since the proximity 
operators of both g and {hi)i<£<L can be efficiently computed. Indeed, prox^ has a closed form 
for several norms and mixed norms nail], while (prox;j^)i<£<i can be computed through the 
projection onto the standard simplex, as described in Proposition 3T The projection onto the 
simplex can be efficiently computed with the method proposed by 


( 10 ) 


Proposition 3.1. For every (. G {1,..., L}, 

{My^^ G prox;,;,^ - Ps^ {y^^^ -h r^), 


with 


K 


Sx = {u= e [0,+oo[^ I = A}. 


k=l 


Proof. Note that u G i—)• A maxi<fc</^ is the support function of S\, defined as (Vtt G M^) 

as^{u) = sup^g 5 ^ v~^u. Hence, for every G Xhe{y^^^) = crsxiu^^^ + and 


Prox;^;^^ = prox^^ -h rg) - re, 


Since asx is the conjugate function of lsx , (10) is deduced by applying Moreau’s decomposition 
formula [50l Theorem 14.3(ii)] and Q. □ 
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The iterations associated with Problem 0 are summarized in Algorithm where the sequence 
(xW)jgi(j is guaranteed to converge to a solution to Problem 0: provided that such a solution exists 
[laiso]. In Algorithmic we use the notation 

T=[Tj ... tIY, r=[rj ... rlY. 


Remark 3.2. Algorithm^ allows us to solve Problem Q together with its (Fenchel-Rockafellar) 
dual formulation 

L 


minimize g*{-T^y)-'^r] s.t. y£{Sx)^, (H) 


where g* is the convex conjugate of g. In the ease when 5 = (1/2)||-||2, the primal and dual solutions 
are linked byx= —T~^y, and thus Problem (11) reduces to the (Lagrangian) dual formulation of 
Problem (Q used in standard SVMs m- 


Algorithm 1 FBPD for solving Problem (jC 
Initialization 

choose e 

choose G 

.set T > 0 and cr > 0 such that rcr||T|p < 1. 

For i = 0, 1, ... 

x[*+il = prox^g (xW - 

y[*+i] = 2/[*l + crr(2x[*+il - xH) 

+arY 


3.2 Constrained formnlation 


Problem (j^) presents a more challenging computational issue, as the projection onto the hinge-loss 
constraint set cannot be evaluated in closed form, and it would require to solve a constrained 
quadratic problem at each iteration. In order to manage this constraint, we propose to introduce a 
vector of auxiliary variables C = in th® minimization process, so that Problem (j^) can 

be equivalently rewritten as 


minimize g{x) s.t. 
(x,C)eR(^+F-ff 


' e=i 

,(V£g{1,...,L}) he{Tix)<C^^\ 


( 12 ) 


Interestingly, our approach is conceptually similar to adding the slack variables in (§, even though 
our reformulation specifically aims at simplifying the way of solving the problem. Indeed, a possible 









interpretation of Problem (12) is the following: 


minimize g{x) s. t. 
(x,C)eR("+i)^xR^ 


(Tx, C) G E, 


C 

where E denotes the collection of epigraphs oi hi,..., 

E = {iy,C)€R^^ xR^ I (V£g{ 1,...,L}) C^")) G epi/i,}, 

and Vri denotes a closed half-space 

L 


p^ = {Cgm^ I < 77 }. 


i=l 


(13) 


The iterations related to Problem (13) are listed in Algorithm]^ where the sequence (xW, cw) 

isN is 

guaranteed to converge to a solution to (13), provided that such a solution exists 


Algorithm 2 FBPD for solving Problem (8) 


Initialization 

choose G x 

choose G rEk-i) ^ 

.set T > 0 and cr > 0 such that rcrmax{||T|p, 1} < 1. 

For i = 0, 1, ... 

x[*+b = prox^g (xW - 

yW = y[*l -h crr(2x[*+ll -X^) 

^*1 =([*[ -hCT(2C[*+i[ -c'*') 
{y^^^^)=PEiy^Va,dVa) 

y[i+l] ^ ^[i] _ cryW 


The advantage of our approach lies in the fact that the projections onto E and 1^ employed in 
Algorithm have closed form expressions. Indeed, the projection onto Vr^ is straightforward |43( 
Section 6.2.3], while the projection onto E can be block-decomposed as 




i<e<L 


(14) 


where a closed-form expression of Pepihe with i G {1,... 
of this new result follows the same line as the proof by 
epigraphical projection associated to the £oo-norm. 


L} is given in Proposition 3.3 The proof 


Proposition 5], where we derived the 


The decomposition in (14) yields two potential benefits. 


computed onto the lower-dimensional convex subset epi hi of) 


fK 


Firstly, the projection Pepihf is 
X M, whose dimensionality is only 
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fixed by the number K of classes. Secondly, these projections can be computed in parallel, since 
they are defined over disjoint blocks whose number is given by the cardinality L of the training set 
(we refer to m for an example of parallel implementation on GP-GPUs). 

Proposition 3.3. For every i G {1,... ,L}, let hi be the funetion defined in Q and, for every 
xK, let be the sequence sorted in ascending orderj^ 

and set = —oo and Then, Pepihe{y^^\ with 




mm 


in{y(^’"),0W-rf} 


J l<k<K 


and 


K 




+ E 


Xhk) 






where k is the unique integer in {1,..., iG + 1} such that 




with the convention 'Yl!k=K+i ' = 0. 


(15) 


(16) 


Proof. For every 


(yW^^W) G X M, Pepihe{y^^\C^^^) denotes 
min + {0^^^ — 

)£epi hi 


the unique solution to 
^W)2^ 


which is equivalent to find the minimizer of 


min 

6»(^)eiR 


|(0W_^W)2 + 


min 

<0(^) — 





p(i,K)<g(e)_fiK) 


(17) 


For every G M, the inner minimization is achieved when =min{y(^’^\ 9^^^ — for each 
A: G {1,..., K}, reducing to 


min I(0^^^ — + E(maxjy^^’^^ + 0})^|) 

which achieves its minimum when = proX(^^((^*^^)), with 

(Vu G M) (^^(u) = - “ ^’^})^- (^^) 

^ k=l 

®Note that the expensive sorting operation can be avoided by using a heap data structure [^, which keeps a 
partially-sorted sequence such that the first element is the largest. This approach was used, e.g., by van den Berg et 
al. [5,31 Algorithm 2] for implementing the projection onto the £i-ball. 
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The closed-form expression of this proximity operator, as well as the projection onto epi/i^, are 


derived in the following. In order to prove (15), we need to compute the proximity operator of ipi 
defined in (18). Such a function belongs to ro(M) since, for each A;G{1, ... ,K}, max{(i/(^’^) — -jjO} 
is hnite convex and is hnite convex and increasing on [0,+oo[. In addition, is differentiable 
and such that, for every u G M and k G {1 ,... + 1}, 




< V < u 


{e,k) 


1 ^ 

Mv) = X X] 


\v — V 


(hi 


T- 


j=k 


Therefore, there exists a G {I,..., K -|- 1} such that < 0(^1 < which yields 

(16). Moreover, by the definition of proximity operator, 0^^^ = prox^^(C(^)) is uniquely defined by 


_ gV) = (^^(6»W), yielding 


_ qW ^ ^q{1) _ 

k=kW 


which is equivalent to (15). The uniqueness of k^^'^ follows from that of 


prox, 


‘Pi 


(CW). 


□ 


4 Numerical results 


In this section, we numerically evaluate the performance of sparse multiclass SVM w.r.t. the three 
following databases. 


• Leukemia database. The first experiment concerns the classification of microarray data. 
The considered database contains 72 samples oi N = M = 7129 gene expression levels (so that 
(j){u) = u) measured from patients having K = 3 types of leukemia disease [5l]. The database 
is usually organized in L = 38 training samples and 34 test samples]^ In our experiments, we 
used blocks of 5 genes for the mixed-norm regularization. 

• MNIST dataset. The second experiment concerns the classification of handwritten digits. 
More precisely, we consider the MNIST database [55], which contains a number of 28 x 28 
grayscale images {N = 784) displaying digits from 0 to 9 {K = 10). The database is organized 
in 60000 training images and 10000 test imagesj^ In our experiments, we defined the mapping 
(p by resorting to the scattering convolution network |56) with m = 2 wavelet layers scaled 
up to 2'^ = 4, which transforms an input image of size 28 x 28 in 81 images of size 14 x 14 
(thus M = 15876). For the regularization, we used the ^i^oo-norm by dividing each vector 
(x(^))i<fc<^ in 14^ blocks of size 81. Moreover, in order to evaluate the performance, we 
trained a classifier on 25 different training subsets of size L G {3K, 5K, lOitT}, we computed 
the classification errors by evaluating the 25 trained classifiers on the whole test set, and we 
averaged the resulting errors. 

^Data available at www.broadinstitute.org/cancer/software/genepattern/datasets 
®Data available at bttp://yann.lecun.com/exdb/mnist 
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• News20 database. The third experiment concerns the classification of text documents into 
a fixed number of predefined categories. More precisely, we consider the News20 database 
ISZl, which contains a number of documents partitioned across K = 20 different newsgroups. 
The database is organized in 11314 training documents and 7532 test documents]^ In our 
experiments, we defined the mapping (j) by resorting to the term frequency - inverse document 
frequency transformation [SH], yielding M = 26214. For the regularization, we used .^ 1 ^ 2 -iiorm 
in the same way as [38]. Moreover, in order to evaluate the performance, we trained a classifier 
on 10 different training subsets of size L G {5K, lOK, 50K}, we computed the classification 
errors by evaluating the 10 trained classifiers on the whole test set, and we averaged the 
resulting errors. 


4.1 Assessment of classification accuracy 

In this section, we evaluate the classification errors obtained with the sparse multiclass SVM 
formulated in Problems Q-Q. Our objective here is to show that the exact hinge loss allows us to 
achieve better performance than its approximated smooth versions, especially with a few training 
data. Hence, we compare the proposed method with the following approaches: 


the multiclass SVM proposed by |38| 

L 

minimize i^fx) + A ( max {0,yLi + ipiug)^ 

the multinomial logistic regression (e.g., see m) 

L 


minimize ^(x) + A log 1 + exp \ yLi + {yi 

xPR(M+l)if ^ L 




i=i 


(19) 


( 20 ) 


the binary SVM by |30j based on the “one-vs-all” strategy, which aims, for every k G 
{1,...,A}, to 


minimize g(x) + A y^ ( max |0, + zi (/?(uf )''~x^^^ |) 

x(fc)eE(A'-f+i) V / 


( 21 ) 


t=i 


with zg^ being equal to zg = k, and —1 otherwise. Note that (21) may be seen as a special 


case of (19). 


In the following, we refer to Problems ([^-([^ as hinge, and to Problems (19)-([2T|), respectively, 
as square, logit, and one-vs-all. Since the parameters A and rj need to be estimated (e.g., through 
cross validation), it is important to evaluate the impact of their choice on the performance, although 
it is out of the scope of this paper to devise an optimal strategy to set this bound. To compare the 
above methods for different choices of these parameters, we set A = a~^ or r] = aL, by varying a 
inside a fixed set of predefined values. We also follow the usual convention of setting yLg = l. 


^Data available at www.cad.zju.edu.cn/honie/dengcai/Data/TextData.html 
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Table 1: Comparisons on the leukemia database. 


9{x) 


HINGE 


SQUARE 


LOGIT 


ONE-VS-ALL 

errors 

non-zero coeff. 

errors 

non-zero coeff. 

errors 

non-zero coeff. 

errors 

non-zero coeff. 

e-i 

1/34 

7129 + 7129 + 7129 

2/34 

7129 + 7129 -b 7129 

1/34 

7129 + 7129 -b 7129 

2/34 

7129 + 7129 -b 7129 


2/34 

13 + 03 + 10 

3/34 

8 + 3 -b 8 

3/34 

18 -b 05 + 14 

3/34 

19 + 8 -b 15 

^1,2 

0/34 

95 -b 5 -b 75 

1/34 

55 + 05 + 45 

0/34 

50 -b 05 + 35 

1/34 

70 -b 10 + 50 

«l,oo 

0/34 

50 -b 5 -b 45 

0/34 

35 + 05 + 35 

0/34 

50 -b 05 + 40 

0/34 

45 + 5 -b 45 


• Leukemia database. Table reports the classification errors, as well as the number of 
non-zero coefficients in vectors (x*^^))i<fc< 3 , obtained with hinge, square, logit, and one-vs-all 
using various regularization terms. For each method, we set a to the value yielding the 
best accuracy (by using a simple trial-and-error strategy). The results indicate that sparse 
regularization allows us to effectively select a small set of important features for each prediction 
vector (x(^^)i<fc< 3 , with better results than the quadratic regularization. In addition, the 
classification errors show that hinge is often more accurate than square. 


MNIST database. Figures ^ Ic and le report the classification errors as a function of the 
regularization hyperparameter. These results were obtained with the ^i^oo-norm regularization, 
as it was the one leading to the best results in all our experiments on this database. The 
classification errors indicate that the hinge approach is slightly more accurate than the other 
ones. On the other side. Figures lb [T^ and [l^ report the percentage of zero coefficients 
in vectors as a function of a. The plots show that the hinge approach yields 

solutions slightly more sparse than the other ones. 


News20 database. Figures 2a, and 2e report the classification errors (as a function 


of the regularization hyperparameter) obtained by using the £i_ 2 -norm regularization. The 
classification errors indicate that the hinge approach is slightly more accurate than the square 
approach. The plots also show that the results obtained with the hinge approach are more 
robust w.r.t. the choice of the regularization parameter. On the other side. Figures |2b| , |2d| 
and 2f report the percentage of zero coefficients in vectors as a function of a. 


The plots show that the hinge approach yields solutions as sparse as the square approach. 
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Figure 1: Results on MNIST database with the fi^oo-regularization for L G {3K, 5K, lORT}. Left 
column: classification errors as a function of a. Right column: percentage of zero coefficients in 
vectors as a function of a. The circles mark the values yielding the best accuracy. 
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(e) LjK = 50 (errors vs a) 



(f) L/K = 50 (sparsity vs a) 


Figure 2: Results on News20 database with the 2 -regularization for L G {5K, lOR, 50K}. Left 
column: classification errors as a function of a. Right column: percentage of zero coefficients in 
vectors as a function of a. The circles mark the values yielding the best accuracy. 
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4.2 Assessment of execution times 


In this section, we compare the execution times of Algorithms and witlQ 


a FISTA implementation of Problem (19) 


a forward-backward implementation of Problem (20), 


a FBPD implementation of Problem (12) reformulated with linear constraints 

L K 


minimize g{x) s.t. 


(V£g{1,...,L}) = = 

(V£ G {1,L}) > 0,..., > 0, 

(V£g{ 1,...,L}) r,x + r,-(C(^’")) i<fc<x < 0. 


This approach is conceptually similar to the linear programming methods proposed by |33 
and |37] for ii- or ^i^+oo-regularized SVMs. 


Figures 3a, 3c and 3e show the execution times (averaged among 10 training sets) obtained by the 
above algorithms for various values of A and rj on the MNIST database with L G {3iF, 5K, lOA}. 
In this experiment, the execution times refer to a stopping criterion of 10“^ on the relative error 
between two consecutive iterates. Conversely, Figures 3b, 3d and show the relative distance to 
||xW — x[°°l ||/||xl°°] I (as a function of time) for the values of A and rj yielding the best accuracy (as 
reported in Figure where x^^l denotes the solution computed with a stopping criterion of 10“®. 
These results demonstrate that the proposed algorithms are faster than the approaches based on 
linear constraints and logistic regression, while being comparable in terms of execution times to 
approaches based on the square hinge loss. In addition. Algorithm turns out to converge faster 
than Algorithmic This can be explained by the higher computational cost of the projection onto 
the standard simplex. 


4.3 Quadratic regularization 

Although our emphasis is on sparse learning, we propose to complete our analysis by evaluating 
the efficiency of the proposed algorithms in the case when 5 is a quadratic regularization function. 
To this end, we compare the execution times of Algorithms [T] and with the SVM-struct algorithm 
proposed by [6], which provides a numerical approach for solving Problem (|C) through a cutting- 
plane technique. Figure reports the execution times (averaged on 10 training sets) obtained 
by the above methods on the MNIST database with L G {3iF, 5iF, lOAT, 50A, lOOiF, SOOAT} and 
different values of a. In this experiment, we set the stopping criterion to 10“^ in all methods, and 
the regularization parameter of SVM-struct to Lja. The results show that the proposed algorithms 
are competitive with state-of-the-art solutions in scenarios with a limited number of training data. 
The same cannot be claimed for larger databases, as SVM-struct scales particularly well w.r.t. the 

^The codes were implemented in MATLAB and executed on a Intel CPU at 3.33 GHz and 24 GB of RAM. 
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Figure 3: Results on MNIST database with the £i,oo-regularization for L G {SR, 5R, WK}. Left 
column: execution time as a function of a, where the circles mark the values yielding the best 
accuracy (as reported in Figure [^. Right column: distance to (as a function of time) obtained 
with the values of a marked by a circle in the left column (note that the one-vs-all approach, being 
defined by multiple optimization problems, does not allow us to determine the iterate at each 
iteration, hence the associated plot cannot be traced). 




































number M of features and the size L of the training set. Note however that, when L/K = 500, 
the number of significant features for the SVM classifier designed with a quadratic regularization 
is equal to M — 546 = 158214 (by setting a threshold to 10“®), while a sparse approach using an 
f’l.oo-norm regularization yields only 42795 nonzero features. 


5 Conclusions 


We have proposed two efficient algorithms for learning a sparse multiclass SVM. Our approach 
makes it possible to minimize a criterion involving the multiclass hinge loss and a sparsity-inducing 
regularization. In the literature, such a criterion is typically approximated by replacing the hinge 
loss with a smooth penalty, such as the quadratic hinge loss or the logistic loss. In this paper, we 
have provided two solutions that directly deal with the hinge loss: one addressing the regularized 
formulation and the other one adapted to the constrained formulation. The performance of the 
proposed solutions have been evaluated over three databases in scenarios with a few training data. 
The results show that the use of the hinge loss, rather than an approximation, leads to a slightly 
better classification accuracy and tends to make the method more robust w.r.t. the choice of 
the regularization parameter, while the proposed algorithms are often faster than state-of-the-art 
solutions. 
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